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Abstract 

Artificial agents today can answer factual questions. But 
they fall short on questions that require common sense rea¬ 
soning. Perhaps this is because most existing common sense 
databases rely on text to learn and represent knowledge. 
But much of common sense knowledge is unwritten - partly 
because it tends not to be interesting enough to talk about, 
and partly because some common sense is unnatural to ar¬ 
ticulate in text. While unwritten, it is not unseen. In this pa¬ 
per we leverage semantic common sense knowledge learned 
from images - i.e. visual common sense - in two textual 
tasks: fill-in-the-blank and visual paraphrasing. We pro¬ 
pose to ''imagine” the scene behind the text, and leverage 
visual cues from the "imagined” scenes in addition to tex¬ 
tual cues while answering these questions. We imagine the 
scenes as a visual abstraction. Our approach outperforms a 
strong text-only baseline on these tasks. Our proposed tasks 
can serve as benchmarks to quantitatively evaluate progress 
in solving tasks that go "beyond recognition”. Our code 
and datasets will be made publicly available. 

1. Introduction 

Today’s artificially intelligent agents are good at answer¬ 
ing factual questions about our world [9, 15, 41]. For 
instance, Siri\ Cortana^, Google Now^, Wolfram Alpha^ 
etc., when asked “How far is the closest McDonald’s to 
me?”, can comprehend the question, mine the appropri¬ 
ate database {e.g. maps) and respond with a useful answer. 
While being good at niche applications or answering factual 
questions, today’s AI systems are far from being sapient in¬ 
telligent entities. Common sense continues to elude them. 
Consider a simple fill-in-the-blank task shown in Fig- 

^ https://www.apple.com/ios/siri/ 

^http ://www.windowsphone.com/en-us/how-to/wp8/ 
cortana/meet-cortana 

^http ://www.google.com/landing/now/ 

"^http : //www. wolframalpha . com/ 


Fill-in-the-blank: 


Mike is having lunch 
when he sees a bear. 


A. Mike orders a pizza. 

B. Mike hugs the hear. 

C. Bears are mammals. 

D. Mike tries to hide. 


Visual Paraphrasing: 

Are these two descriptions 

describing the same scene? 

1. Mike had his baseball 
hat at the park. Jenny 
was going to throw her 
pie at Mike. Mike was 
upset he didn’t want 
Jenny to hit him with a 
pie. 

2. Mike is holding a bat. 
Jenny is very angry. 
Jenny is holding a pie. 


Figure 1. We introduce two tasks: fill-in-the-blank (FITB) and vi¬ 
sual paraphrasing (VP). While they seem like purely textual tasks, 
they require some imagination - visual common sense - to answer. 


ure 1 (left). Answering this question requires the common 
sense that bears are dangerous animals, people like to stay 
away from and not be noticed by dangerous animals, and 
hiding is one way of going unnoticed. Similarly, consider 
the visual paraphrasing question in Figure 1 (right). An¬ 
swering this question involves common sense that people 
might throw things when they are angry. Today’s systems 
are unable to answer such questions reliably. 

Perhaps this is not surprising. Most existing common 
sense knowledge bases rely on knowledge described via 
text - either mined [6, 24, 29] or manually entered [33, 
39, 5, 40]. There are a few short-comings of learning com¬ 
mon sense from text. First, it has been shown that people 
tend not to explicitly talk about common sense knowledge 
in text [18]. Instead, there is a bias to talk about unusual 
circumstances, because those are worth talking about. Co¬ 
occurrence statistics of visual concepts mined from the web 
has been shown to not generalize to images [31]. Even when 
describing images, text is likely to talk about the salient 
“foreground” objects, activities, etc. But common sense 
reveals itself even in the “background”. Second, much of 
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useful common sense knowledge may be hard to describe 
in text. For instance, the knowledge that “one person is run¬ 
ning after another person” implies that the first person is 
facing the second person, the second person is looking in 
the same direction as the first person, and both people are in 
running poses, is unnatural (and typically unnecessary) to 
articulate in text. 

Fortunately, much of this common sense knowledge is 
depicted in our visual world. We call such common sense 
knowledge that can be learnt from visual data visual com¬ 
mon sense. By visual common sense we do not mean visual 
models of commonly occurring interactions between ob¬ 
jects [10] or knowledge of visual relationships between ob¬ 
jects, parts and attributes [8, 44]. We mean semantic com¬ 
mon sense, e.g. the knowledge that if one person is running 
after another person, and the second person turns around, 
he will see the first person. It can be learnt from visual data 
but can help in a variety of visual and non-visual AI tasks. 
Such visual common sense is complementary to common 
sense learnt from non-visual sources. 

We argue that the tasks shown in Figure 1 may look 
like purely text- or language-based tasks on the surface, but 
they can benefit from visual common sense. In fact, we 
go further and argue that such tasks can provide exciting 
new benchmarks to evaluate image understanding “beyond 
recognition”. Effectively learning and applying visual com¬ 
mon sense to such tasks involves challenges such as ground¬ 
ing language in vision and learning common sense from vi¬ 
sual data - both steps towards deeper image understanding 
beyond naming objects, attributes, parts, scenes and other 
image content depicted in the pixels of an image. 

In this work we propose two tasks: fill-in-the-blank 
(FITB) and visual paraphrasing (VP) - as seen in Figure 1 
- that can benefit from visual common sense. We propose 
an approach to address these tasks that first “imagines” the 
scene behind the text. It then reasons about the generated 
scenes using visual common sense, as well as the text using 
textual common sense, to identify the most likely solution 
to the task. In order to leverage visual common sense, this 
imagined scene need not be photo-realistic. It only needs to 
encode the semantic features of a scene (which objects are 
present, where, what are their attributes, how are they inter¬ 
acting, etc.). Hence, we imagine our scenes in an abstract 
representation of our visual world - in particular using cli¬ 
part [45, 46, 17, 1]. 

Specifically, given an FITB task with four options, we 
generate a scene corresponding to each of the four descrip¬ 
tions that can be formed by pairing the input description 
with each of the four options. We then apply a learnt model 
that reasons jointly about text and vision to select the most 
plausible option. Our model essentially uses the generated 
scene as an intermediate representation to help solve the 
task. Similarly, for a VP task, we generate a scene for each 


of the two descriptions, and apply a learnt joint text and vi¬ 
sion model to classify both descriptions as describing the 
same scene or not. We introduce datasets for both tasks. 
We show that our imagination-based approach that lever¬ 
ages both visual and textual common sense outperforms the 
text-only baseline on both tasks. Our datasets and code will 
be made publicly available. 

2. Related Work 

Beyond recognition: Higher-level image understand¬ 
ing tasks go beyond recognizing and localizing objects, 
scenes, attributes and other image content depicted in the 
pixels of the image. Example tasks include reasoning about 
what people talk about in images [4], understanding the 
flow of time (when) [35], identifying where the image is 
taken [22, 26] and judging the intentions of people in im¬ 
ages (why) [36]. While going beyond recognition, these 
tasks are fairly niche. Approaches that automatically pro¬ 
duce a textual description of images [20, 13, 27] or synthe¬ 
size scenes corresponding to input textual descriptions [46] 
can benefit from reasoning about all these different “W” 
questions and other high-level information. They are se¬ 
mantically more comprehensive variations of beyond recog¬ 
nition tasks that test high-level image understanding abili¬ 
ties. However, these tasks are difficult to evaluate [27, 12] 
or often evaluate aspects of the problem that are less rele¬ 
vant to image understanding e.g. grammatical correctness of 
automatically generated descriptions of images. This makes 
it difficult to use these tasks as benchmarks for evaluating 
image understanding beyond recognition. 

Leveraging visual common sense in our proposed EITB 
and VP tasks requires qualitatively a similar level of image 
understanding as in image-to-text and text-to-image tasks. 
EITB requires reasoning about what else is plausible in a 
scene given a partial textual description. VP tasks on the 
other hand require us to reason about how multiple descrip¬ 
tions of the same scene could vary. At the same time, EITB 
and VP tasks are multiple-choice questions and hence easy 
to evaluate. This makes them desirable benchmark tasks for 
evaluating image understanding beyond recognition. 

Natural language Q&A: Answering factual queries in 
natural language is a well studied problem in text retrieval. 
Given questions like “Through which country does the 
Yenisei river fiow?”, the task is to query useful informa¬ 
tion sources and give a correct answer for example “Mon¬ 
golia” or “Russia”. Many systems such as personal assistant 
applications on phones and IBM Watson [15] which won 
the Jeopardy! challenge have achieved commercial success. 
There are also established challenges on answering factual 
questions posed by humans [9], natural language knowl¬ 
edge base queries [41] and even university entrance exams 
[34]. The EITB and VP tasks we study are not about facts, 
but common sense questions. 


Leveraging common sense: Common sense is an im¬ 
portant element in solving many beyond recognition tasks, 
since beyond recognition tasks tend to require information 
that is outside the boundaries of the image. It has been 
shown that learning and using non-visual common sense 
{i.e. common sense learnt from non-visual sources) benefits 
physical reasoning [21, 43], reasoning about intentions [36] 
and object functionality [44]. One instantiation of visual 
common sense that has been leveraged in the vision com¬ 
munity in the past is the use of contextual reasoning for im¬ 
proved recognition [20, 11, 19, 16, 23, 44]. In this work, we 
explore the use of visual common sense for seemingly non¬ 
visual tasks through “imagination”, i.e. generating scenes. 

Synthetic data: Learning from synthetic data avoids te¬ 
dious manual labeling of real images. It also provides a plat¬ 
form to study high-level image understanding tasks with¬ 
out having to wait for low-level recognition problems to be 
solved. Moreover, synthetic data can be collected in large 
amounts and with high density, allowing us to learn rich 
models. Previous works have looked at learning recogni¬ 
tion models from synthetic data. For instance, computer 
graphics models were used to synthesize data to learn hu¬ 
man pose [38] and chair models [2]. Clipart data has been 
used to learn models of fine-grained interactions between 
people [1]. [30] warps images of one category to use them 
as examples for other categories. [25] uses synthetic im¬ 
ages to evaluate low-level image features. Human-created 
clipart images have been used to learn which semantic fea¬ 
tures (occurrence or co-occurrence of objects, pose, expres¬ 
sion, relative location, etc.) are relevant to the meaning of 
a scene [45] and to learn spatio-temporal common sense to 
model scene dynamics [17]. In this work, we learn our mod¬ 
els from human-created clipart scenes. We also use clipart 
to “imagine” scenes in order to solve the FITB and VP tasks. 
Though the abstract scenes [45] are not photo-realistic, they 
offer, more importantly, a semantically rich world where 
one can effectively generate scenes and learn semantic vari¬ 
ations of sentences and scenes, free from the bottlenecks of 
(still) imperfect object recognition and detection. Despite 
being synthetic, it has been shown that semantic concepts 
learnt from abstract scenes can generalize to real images [ 1 ]. 

3. Dataset 

We build our FITB and VP datasets on top of the Ab¬ 
stract Scenes Dataset^, which has 10,020 human-created 
abstract scenes of a boy and a girl playing in the park. The 
dataset contains 58 clipart objects including the boy (Mike), 
the girl (Jenny), toys, background objects like trees and 
clouds, animals like dogs and cats, food items like burg¬ 
ers and pizzas, etc. A subset of these objects are placed 

^http ://research.microsoft.com/en-us/um/people/ 
larryz/clipart/abstract_scenes.html 


in the scene at a particular location, scale, and orientation 
(facing left or right). The boy and the girl can have differ¬ 
ent poses (7) and expressions (5). Each one of the 10,020 
scenes has textual descriptions written by two different peo¬ 
ple. We use this clipart as the representation within which 
we will “imagine” our scenes. We also use this dataset to 
learn visual common sense. While more clipart objects, ex¬ 
pressions, poses, etc. can enable us to learn more compre¬ 
hensive visual common sense, this dataset has been shown 
to contain semantically rich information [45, 46], sufficient 
to begin exploring our proposed tasks. We now describe our 
approach to creating our FITB and VP datasets. 

3.1. Fill-in-the-blank (FITB) Dataset 

Every description in the Abstract Scenes Dataset con¬ 
sists of three short sentences, typically describing differ¬ 
ent aspects of the scene while also forming a coherent de¬ 
scription. Since we have two such descriptions for every 
scene, we arbitrarily place one of the two descriptions (for 
all scenes) into the source set and the other into the distrac- 
tor set. For each image, we randomly drop one sentence 
from its source description to form an FITB question. We 
group this dropped sentence with 3 random sentences from 
descriptions of other images in the distractor set. The FITB 
task is to correctly identify which sentence in the options 
belongs to the original description in the question. 

Removing questions where the NLP parser produced de¬ 
generate outputs, our resulting FITB dataset contains 8,959 
FITB questions - 7,198 for training and 1,761 for test¬ 
ing. Figure 3 shows one example FITB question from our 
dataset. The scenes corresponding to the questions in the 
training set are available for learning visual common sense 
and text-image correspondence. The scenes corresponding 
to the test questions are not available at test time. 

FITB is a challenging task. Many scenes share the same 
visual elements such as Mike and Jenny playing football. 
Sometimes the distractor options may seem just as valid as 
the ground truth option, even to humans. We conduct stud¬ 
ies on human performance on the test set. We had 10 dif¬ 
ferent subjects on Amazon Mechanical Turk (AMT) answer 
the FITB questions. To closely mimic the task given to ma¬ 
chines, subjects were not shown the corresponding image. 
We found that the majority vote response (i.e. mode of re¬ 
sponses) across 10 subjects agreed with the ground truth 
52.87% of the time (compared to random guessing at 25%). 

Some questions have disagreements among the subjects, 
while other questions have consistent responses across sub¬ 
jects. We find that 41% of the questions in our dataset have 
7 or more subjects agreeing on the response. Of these ques¬ 
tions, the mode of the responses across subjects agrees with 
the ground truth 69% of the time. Interestingly, on the re¬ 
maining 31% of the questions, 7 out of 10 subjects agree on 
the wrong response. In our experiments, we report accura- 
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Figure 2. Human performance vs. inter-human agreement on the 
FITB task. Mode of human responses is more accurate when sub¬ 
jects agree with each other. 

cies relative to the ground truth response, as well as relative 
to the response that most subjects agree on (the latter might 
be more relevant from an AI perspective - if the goal is to 
produce human-like responses). 

In Figure 2, we consider different subsets of the dataset 
formed by only considering questions where a certain mini¬ 
mum proportion of subjects agreed on the response (human 
agreement). For each subset, we can evaluate the accuracy 
of the mode response. We also look at what percentage of 
the dataset falls in each subset. Not surprisingly, human ac¬ 
curacy (mode agreeing with ground truth) correlates well 
with human agreement (percentage of subjects that agree 
with mode). Note that even if responses were random, on 
average 43% of subjects would agree on the mode response. 

3.2. Visual Paraphrasing (VP) Dataset 

The VP task is to tell if two descriptions are describing 
the same scene or two different scenes. The correct answer 
to a pair of descriptions written by two people describing 
the same scene is “Yes”, while to randomly drawn descrip¬ 
tions from two different scenes is “No”. 

We build our VP dataset using all 10,020 scenes from the 
Abstract Scenes Dataset, resulting in a dataset with 10,020 
positive pairs. We randomly sample 2 x 10,020 pairs as 
negatives. This leads to a total of 30,060 questions in our 
dataset. Of these, 24,000 are used for training and the rest 
6,060 are used for testing. We choose the negative pairs 
separately in training and testing sets such that they do not 
overlap with each other. Figure 4 shows one example VP 
question from our dataset. 

We evaluate human performance on our test set. We had 
10 different subjects on AMT solve our tasks. We average 
their responses (0 for No and 1 for Yes) to obtain a score 
between 0 and 1 for each question. We can use this score 
to plot a precision-recall curve. Results show that humans 
can reliably solve this task with 94.78% average precision 
(AP), compared to chance at 33%. 

FITB and VP tasks are ways to evaluate visual common 


sense. Some applications of FITB tasks may be automatic 
story telling and automatic Q&A. Some applications of the 
VP task may be text-based image retrieval and generating 
multiple diverse descriptions of the same image. 

4. Approach 

We first (Section 4.1) describe the strong baseline ap¬ 
proach of using textual features (common sense) to solve 
the FITB and VP tasks. We then describe our visual com¬ 
mon sense model (Section 4.2.2) and scene generation ap¬ 
proach (Section 4.3). Finally in Section 4.4 we describe 
our approach to using our model to solve the FITB and VP 
tasks. 

4.1. Text Only Model 

We first tokenize all words in our dataset and form a 
vocabulary (1,886 words for the FITB dataset and 2,495 
for the VP dataset). We also form a vocabulary of pairs 
of words by selecting 100 pairs of words which have the 
highest mutual information in the training data and co-occur 
more than 100 times. 

Both FITB and VP involve reasoning about consistency 
between two descriptions (question and option for FITB 
and two input descriptions for VP). Given two descriptions 
di and d 2 , we extract three kinds of textual features from 
the pair. The first is term frequency, commonly used for 
text classification and retrieval, which counts how often 
each word from our vocabulary occurs in {di,d 2 ) (both de¬ 
scriptions concatenated). The second is a 400D word co¬ 
occurrence vector indicating for each (of the 100) pair of 
words whether: (i) the first word occurred in di and the 
second word occurred in d 2 or (ii) the first word occurred in 
di and the second word did not occur in d 2 or (iii) the first 
word did not occur in di and the second word occurred in 
d 2 or (iv) the first word did not occur in di and the second 
word did not occur in d 2 . The third uses a state-of-the-art 
deep learning based word embedding representation learnt 
from a large text corpus. We use word2vec [32] to represent 
each word with a (default) 200D vector. We then average 
the vector responses of all words in {di,d 2 ). These fea¬ 
tures capture common sense knowledge about which words 
are used interchangeably to describe the same thing, which 
words tend to co-occur in descriptions, etc. 

Fill-in-the-blank. For N fill-in-the-blank questions and 
M options per question, we denote the question body as 
Qi^i G and the options for qi as Oij^j G 

M}. We denote the ground truth option for ques¬ 
tion Qi as of, and its index as jf. 

The FITB problem is a ranking problem: given q^, we 
wish to rank the correct option of above distractors Oij , j ^ 
jf. For each question-option pair {q^, Oij), we extract the 
three kinds of textual features as described above using 



di = Qi and d 2 = Oij. Concatenating these three gives 
us a 2,486D text feature vector Oij). We compute 

scores Sij = Oij) for each option that captures 

how likely Oij is to be the answer to qi. We then pick the 
option with the highest score. We learn w using a ranking 
SVM [7]: 


iikf+c x: «« 

(*7) 

w^4>)7tb(.<luof) - w'^4>f^l{qi,Oij) > 1 - Cii, 

( 1 ) 

Visual paraphrasing. In visual paraphrasing, for each 
question i, the goal is to verify if the two given descrip¬ 
tions qn and qi 2 describe the same image (yi = 1) or not 
(yi = —1). We extract all three features described above us¬ 
ing di = qn and ^2 = qi 2 - Let’s call this We extract 
the same features but using di = qi 2 and d 2 = qn. Let’s 
call this To ensure that the final feature represen- 

tation is symmetric - i.e. ^* 2 ) = 

we use (/)*“* = , \4>llf - 0*11*I] i.e. a con¬ 
catenation of the summation of (plfpi and (j)*yp 2 the 

absolute difference between the two. This results in a 
(2 X 2, 495) + (2 X 200) + (2 x 400) = 6,190D feature vec¬ 
tor describing (qn^qn)- We then train a binary linear 
SVM to verify whether the two descriptions are describing 
the same image or not. 


min 

s.t. 


4.2. Incorporating Visual Common Sense 


the text description and the scene, and captures how con¬ 
sistent the imagined scene is to the text (Section 4.2.3). We 
start by describing the representation we use to represent the 
description and to encode a scene via visual abstractions. 

4.2.1 Scene and Description Encoding 

The set of clipart in our visual abstraction were described 
in Section 3. More details can be found in [45]. In the gen¬ 
erated scenes, we represent an object Ok using its presence 
ek G {0,1}, location Xk^yk^ depth Zk (3 discrete scales), 
horizontal facing direction or orientation dk G { — 1,1} (left 
or right) and attributes fk (poses and expressions for the 
boy and girl). The sentence descriptions Si are represented 
using a set of predicate tuples Ti extracted using semantic 
roles analysis [37]. A tuple Ti consists of a primary noun 
Ai, a relation ri and an optional secondary noun Bi. For 
example a tuple can be (Jenny, fiy. Kite) or (Mike, be an¬ 
gry, N/A). There are 1,133 nouns and 2,379 relations in our 
datasets. Each primary noun Ai and secondary noun Bi is 
mapped to 1 of 58 objects ai and bi respectively which have 
the highest mutual information with it in training data. We 
found this to work reliably. 

4.2.2 Visual Common Sense 

We breakdown and introduce the factors in ^(/^) into per- 
object (unary) factors T>^(0/c) and between-object (pair¬ 
wise) factors (O/ci, Ok ^) • 

$(/,) = ^ ^P^{Ok„Ok,) (3) 

k ki,k2 


Our model extends the baseline text-only model (Sec¬ 
tion 4.1) by using an “imagined” scene as an intermediate 
representation. “Imagining” a scene involves settings val¬ 
ues for all of the variables (e.g. presence of objects, their lo¬ 
cation) that are used to encode scenes. This encoding, along 
with priors within this abstraction that reason about which 
scenes are plausible, serve as our representation of visual 
common sense. This is in contrast with traditional knowl¬ 
edge base representations used to encode common sense via 
text [44, 36]. Exploring alternative representations of visual 
common sense is part of future work. 

Given a textual description Si, we generate a scene li. 
We first describe our scoring function that scores the plau¬ 
sibility of the {Si, li) pair. We then (Section 4.3) describe 
our scene generation approach. Our scoring function 

n{Ii, Si) = ^( 5 ,) + ^{h) + ^(/„ 5 ,) ( 2 ) 

captures textual common sense, visual common sense 
and text-image correspondence. The textual common sense 
term ^{Si) = (j)^^^^{Si) only depends on text and is the 

same as the text-only baseline model (Section 4.1). Of the 
two new terms, ^{h) only depends on the scene and cap¬ 
tures visual common sense - it evaluates how plausible the 
scene is (Section 4.2.2). Finally, ^{U, Si) depends on both 


Per-object (unary) factors ^^{Ok) capture presence, lo¬ 
cation, depth, orientation and attributes. This scoring func¬ 
tion will be parameterized by rc’s^ that are shared across 
all objects and pairs of objects. Let L be the log probabili¬ 
ties (MLE counts) estimated from training data. For exam¬ 
ple, L'^{ek) = logP(e/c), where P{ek) is the proportion of 
images in which object Ok exists, and L'^y^{xk^yk\^k) = 
\ogP{xk,yk\zk), where P{xk,yk\zk) is the proportion of 
times object Ok is at location (xk^yk) given that Ok is at 
depth Zk. 

+ w:imdk)+wJL]{h) (4) 

Between-object (pairwise) factors , O/C 2 ) cap¬ 

ture co-occurrence of objects and their attributes, as well as 
relative location, depth and orientation. 

^P^{Ok„Ok^) = wP^LP^{ek„ek^)+wly^Lly^{dx,dy) 

+ {zk, ,Zk,) + wP^LP^ ( 4 ., 4 .) 

+ wP/LP/(fk„fkJ ( 5 ) 

^Overloaded notation with parameters learnt for the text-only baseline 
in Section 4.1 



Here the relative x-location is relative to the orientation 
of the first object i.e. dx = — x^^)- Relative y- 

location is dy = yk^ —yk 2 • These capture where Ok 2 is from 
the perspective of Ok^. The space of (x, is quite large 
(typical image size is 500 x 400). So to estimate the prob¬ 
abilities reliably, we model the locations with GMMs. In 
particular, the factor L'^y^ (^/c, ^/c | ^/c) is over 27 GMM com¬ 
ponents and L^^^{dx^ dy) is over 24 GMM components. 

Notice that since the parameters are shared across all ob¬ 
jects and pairs of objects, so far we have introduced 5 pa¬ 
rameters in Equation 4 and 5 parameters in Equation 5. The 
corresponding 10 log-likelihood terms can be thought of as 
features representing visual common sense. The parame¬ 
ters will be learnt to optimize for the EITB (ranking SVM) 
or VP (binary SVM) tasks similar to the text-only baseline 
described in Section 4.1. 

4.2.3 Text-Image Consistency 

We now discuss terms in our model that score the con¬ 
sistency between an imaged scene and a textual descrip¬ 
tion. We breakdown and introduce the text-image corre¬ 
spondence factors in T^(/i, Si) in Equation 2 into per-noun 
factors and per-relation factors Ti) for 

objects that are mentioned in the description, and default 
per-object factors {Ok) and default between-object fac¬ 
tors {Ok^^Ok^) when the respective objects are not 
mentioned in the description. 

{li , Ti) 

I I 

+ ^^-“-(0,)+ ^ vl/P— 

k^Si kiM^Si 

( 6 ) 

The per-noun factors T^’^+(/^,T/) capture object pres¬ 
ence conditioned on the nouns (both primary and sec¬ 
ondary) in the tuple, and object attributes conditioned on 
the nouns as well as relations in the tuple. Eor instance, 
if the tuple Ti is “(Jenny, kicks, ball)”, these terms reason 
about the likelihood that Jenny and ball exist in the scene, 
that Jenny has a certain attribute (e.g. kicking pose), etc. 
Again, the likelihood of each concept is scored by its log 
probability in the training data. 

= w2+{L:+{ea,\ai) + L^+ieM) 

+ Krf Ktf Uai\auri)+ LI+ (A, \huri) 

(7) 

The per-relation factors ^^’^+(1^, T/) capture relative ob¬ 
ject location (where is bi relative to ai and vice versa), depth 
and orientation conditioned on the relation. Note that these 
factors are shared across all objects because “wearing” in 
(Mike, wears, hat) and (bear, wears, crown) is expected to 
have similar visual instantiations. 


Question 


_. Mike is 

wearing a blue cap. Mike is 
telling Jenny to get off the 
swing 


Answers 


Ground truth: D 
Vision + text: D 

N 


Text alone: A 

y 


Original Scene 



Options and Generated Scenes 


A. There is a 
tree near a table. 

C. The sun is 
in the sky. 



B. The brown 
dog is standing 
next to Mike. 


m 




n 


- 



D. Jenny is standing 
dangerously on the 



Figure 3. Scenes generated for an example FITB question. 


+ Ktyd'Ktyd'idx' ,dy'\ri) 

+ Kt Lit {Zai ,Zb,\ri)+ < J LI+ {dai , db, I n ) 

( 8 ) 

Here dx' = (x^^ — Xai ) and dy' = yij^ — yai captures 

where the primary object is relative to the secondary object. 

The default per-object factors T^^“(0/c) and the de¬ 
fault between-object factors O/C 2 ) capture de¬ 

fault statistics when an object or a pair of objects is not 
mentioned in the description. 4^^“(0/e) captures the de¬ 
fault presence and attribute whereas , 0 /^ 2 ) cap¬ 

tures the default relative location, depth and orientation. 

The default factors are object-specific since each ob¬ 
ject has a different prior depending on its semantic role in 
scenes. The default factors capture object states conditioned 
on the object not being mentioned in a description. We use 
notation D instead of L to stress this point. For example 
D^-iek\Si) = logP(efe|fc ^ Si), D^-izh,, Zk,\Si) = 
\ogP(zk,,Zk^\ki,k2 0 Si). 

= KbeDtbei^klSi) + W^-^D^-fiMSi) 

{Ok, ,Ok,)= wi:;,Di:-{dx, dy\Si) 

+ wll’~Dl^~{zk,,ZkJSi)+wl)^~Dl^~{dk,,dkJSi) 

(9) 

We have now introduced an additional 12 w parameters 
(total 22) that are to be learnt to solve the FITB and VP 
tasks. Notice that this is in stark contrast with the thou¬ 
sands of parameters we learn for the text-only baseline (Sec¬ 
tion 4.1). 

4.3. Scene Generation 

Given an input description, we extract tuples as de¬ 
scribed earlier in Section 4.2.1. We then use the approach 
of Zitnick et al. [46] to generate a scene corresponding 
to the tuples. Briefly, it sets up a Conditional Random 
















Original Scene 

- 


Descriptions 


Generated Scenes 


Mike is eating a pizza. 
Jenny is playing soeeer. 
A eat is eating a hot dog. 


It is a sunny day. 

Mike is sitting with a pizza. 
Jenny is playing with a soeeer ball. ^ 



o 


Answers 

[ Ground truth: Yes Vision + Text: Yes Text alone: Yes | | 

Figure 4. Scenes generated for an example VP question. 



Field (CRF) model with a scoring function very similar to 
^(/i) + ^(/i, S'i). It samples scenes from this model using 
Iterative Conditional Modes with different initializations. 
Details can be found in [46]. 

4.4. Answering Questions with Imagined Scenes 

Fill-in-the-blank. For FITB, we generate one scene us¬ 
ing each question-answer pair Sij = {qi, Oij). Fig. 3 shows 
qualitative examples of scenes generated for FITB. From 
the question-answer pair Sij and the generated scenes lij , 
we extract features corresponding to our scoring function 
(Equation 2) and use them to learn the ranking S VM (Equa¬ 
tion 1) to answer EITB questions. We choose the ranking 
SVM C parameter using 5 fold cross validation. 

Visual paraphrasing. Eor VP we generate one scene 
for each description Sn = qn and Si 2 = qi 2 in the in¬ 
put pair of descriptions. Eig. 4 shows qualitative exam¬ 
ples of scenes generated for VP. We capture the difference 
between the two sentence descriptions by pairing the gen¬ 
erated scenes with the other description i.e. we compute 
VL{Iii^Si 2 ) and ^2(/i2,5'ii) (Equation 2). We extract fea¬ 
tures for both combinations, concatenate the addition of the 
features and the absolute difference of the features to make 
the mapping symmetric. These features are used to train a 
binary SVM that determines whether the input pair of de¬ 
scriptions are describing the same scene or not. We choose 
the SVM C parameter using 5 fold cross validation. 


5. Experiments and Results 
5.1. Fill-in-the-blank 

We present results of our approach on the EITB dataset 
in Table 1. Our approach of “imagining” and joint visual- 
text reasoning achieves 48.04% accuracy, significantly out¬ 
performing the text-only baseline (44.97%) by 3.07% us¬ 
ing only 22 extra feature dimensions (compared to 2,486 
dimensions of the baseline). This brings the performance 
closer to human performance at 52.87%. Leveraging visual 
common sense does help answering these seemingly purely 
text-based questions. 

By breaking down our 22 parameters (corresponding to 
visual features) into object presence (iCg, , ^a6e’ ^a6e’ 

4D), attribute (wj, w’p", 5D) and spa- 


Approach 

Fill-in-the-blank 
Accuracy (%) 

Random 

25.00 

Text baseline 

44.97 

Visual 

33.67 

Text + visual (presence) 

47.02 

Text + visual (attribute) 

46.39 

Text + visual (spatial) 

44.80 

Text + visual (presence,attribute) 

48.60 

Text + visual (all) 

48.04 

Human Mode 

52.87 


Table 1. Fill-in-the-blank performance of different approaches. 


tial configuration w^, wf’, 

Ktyd'^ Kxyd^ Kd~^ 13D) categories, 

we study their individual contribution to FITB performance 
on top of the text baseline. Object presence contributes the 
most (47.02%), followed by attribute (46.39%), while spa¬ 
tial information does not help (44.80%). In fact, only using 
presence and attribute features achieves 48.60%, slightly 
higher than using all three (including spatial). Visual fea¬ 
tures alone perform poorly (33.67%), which is expected 
given the textual nature of the task. But they clearly provide 
useful complementary information over text. In fact, text- 
alone (baseline), vision-Ftext (our approach) and humans all 
seem to make complementary errors. Between text-alone 
and vision-Ftext, 54.68% of the questions are correctly an¬ 
swered by at least one of them. And between text-alone, vi- 
sion-Ftext and human, 75.92% of the questions are correctly 
answered. 


Our model is capable of imagining scenes that may con¬ 
tain more objects than the ones mentioned in text. Our 
model when using only presence does 47.02%, while a vi¬ 
sual common sense agnostic model that only infers objects 
mentioned in the tuples (a/ and bi) does 46.62%. This fur¬ 
ther demonstrates the need for visual common sense based 
imagination, and not treating the text at face value. 

In addition to predicting ground truth, we also study how 
well our approach can mimic human responses. Our ap¬ 
proach matches the human majority vote (mode) response 
39.35% of the times (text alone: 36.40%). When re-trained 
using the human mode as the labels, the performance in¬ 
creases to 45.43%. The text-only baseline method does 
42.25%. These results suggest that mimicking human is a 
more challenging task (text-only was at 44.97% when train¬ 
ing on and predicting ground truth). Note that visual com¬ 
mon sense is also useful when mimicking humans. 

We also study how the performance of our approach 
varies based on the difficulty of the questions. We consider 
questions to be easy if humans agree on the response. We 
report performance of the text baseline and our model on 
subsets of the FITB test set where at least K people agreed 
with the mode. Fig. 5 shows performance as we vary K. 


















Approach 


Visual Paraphrasing 
Average Precision(%) 


Random 
Text baseline 
Visual 

Text + visual (presence) 

Text + visual (attribute) 

Text + visual (spatial) 

Text + visual (presence,attribute) 
Text + visual (all) 


33.33 

94.15 

91.25 

95.08 

94.54 
94.75 
95.47 

95.55 


Human Average 94.78 

Table 2. Visual paraphrasing performance of different approaches. 

On questions with higher human agreement, the visual ap¬ 
proach outperforms the baseline by a larger margin. Quali¬ 
tative results can be found in the supplementary material. 



►Text+Visual 

►Text 


Human Agreement > 

Figure 5. FITB performance on subsets of the test data with vary¬ 
ing amounts of human agreement. The margin of improvement of 
our approach over the baseline increases from 3% on all questions 
to 6% on questions with high human agreement. 

6. Discussion 


5.2. Visual Paraphrasing 

We present results of our approach on the VP dataset in 
Table. 2. Our approach of generating and reasoning with 
scenes does 1.4% better than reasoning only with text. In 
this task, the performance of the text-based approach is al¬ 
ready close to human, while vision pushes it even further to 
above human performance^. 

Similar to the FITB task, we break down the contribution 
of visual features into object presence, attribute and spatial 
configuration categories. Presence shows the most contri¬ 
bution (0.93%). Spatial configuration features also help (by 
0.60%) in contrast to FITB. See Table 2. 

In VP, a naive scene generation model that only imagines 
objects that are mentioned in the description does 95.01% 
which is close to 95.08% where extra objects are inferred. 
We hypothesize that the VP task is qualitatively different 
from FITB. In VP, important objects that are relevant to 
semantic distance between sentences tend to be mentioned 
in the sentences. What remains is to reason about the at¬ 
tributes and spatial configurations of the objects. In FITB, 
on the other hand, inferring the unwritten objects is critical 
to identify the best way to complete the description. The 
VP task can be made more challenging by sampling pairs 
of descriptions that describe semantically similar scenes. In 
fact, the Abstract Scenes dataset contains groups of seman¬ 
tically scenes [45]. Exploring this is part of future work. 
Some qualitative results can be found in the supplementary 
material. 

We would like to stress that FITB and VP are purely tex¬ 
tual tasks as far as the input modality is concerned. The vi¬ 
sual cues that we incorporate are entirely “imagined”. Our 
results clearly demonstrate that a machine that imagines and 
uses visual common sense performs better at these tasks 
than a machine that does not. 


^Likely due to noise on MTurk. 


Leveraging visual knowledge to solve non-visual tasks 
may seem counter-intuitive. Indeed, with sufficient train¬ 
ing data, one may be able to learn a sufficiently rich text- 
based model. However in practice, good intermediate rep¬ 
resentations provide benefits. This is the role that parts and 
attributes have played in recognition [28, 14, 42]. In this 
work, the imagined scenes form this intermediate represen¬ 
tation that allows us to encode visual common sense. 

In this work, we choose clipart scenes as our modal¬ 
ity to “imagine” the scene and harness the power of vi¬ 
sual common sense. This is analogous to works on phys¬ 
ical reasoning that use physics to simulate physical pro¬ 
cesses [21]. These are both qualitatively different from tra¬ 
ditional knowledge bases [8, 44], where relations between 
instances are explicitly represented and used during infer¬ 
ence. Humans cannot always verbalize their reasoning pro¬ 
cess. Hence, using non-explicit representations of common 
sense has some appeal. Of course, alternate approaches, 
including more explicit representations of visual common 
sense are worth investigating. 

Improved scene generation models that better translate 
from text to vision, and better features and modalities to use 
the generated scenes to answer non-visual questions, could 
also show improvements. In our experiments we already 
show that a better scene generation model that infers objects 
beyond what the text mentions shows better performance. 
Instead of generating one image per text description, one 
could consider generating multiple diverse images to better 
capture the underlying distribution [3]. With more visual 
data, one can also expect to learn more sophisticated joint 
text-image representations. Our scoring function is akin to a 
Conditional Random Field model, similar to the scene gen¬ 
eration model [46]. One could envision learning the scene 
generation model and visual common sense models jointly, 
i.e. learning to infer scenes for the FITB or VP tasks. The 
generated scenes capture a semantically rich space. It would 








be interesting to study other tasks that can benefit form this 
intermediate representation. 
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Don't Just Listen, Use Your Imagination: 
Leveraging Visual Common Sense for 
Non-Visual Tasks 


Supplemental Material 


Coarse and Fine-grained Visual Paraphrasing 

• The 10,020 scenes in the Abstract Scenes Dataset are generated from 1,002 
sentences. For each of the 1,002 sentences 10 different people drew 10 
scenes. And then a new set of workers described each of the 10 scenes 
(10,020 total). 

• Scenes that are generated from the same sentence belong to the same 
semantic class, and therefore their sentence descriptions have similar 
semantic meanings. 

• We study coarse-grained and fine-grained visual paraphrasing problems. 

• In the coarse-grained visual paraphrasing problem, the objective is to tell sentences 
describing one semantic class from another. 

• In the fine-grained visual paraphrasing problem, the objective is to tell sentences 
describing the same semantic class from each other. 
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Coarse and Fine-grained Visual Paraphrasing 


• In both coarse- and fine-grained settings, visual features show improvements on 
top of the text-only baseline. 



Source of positive 
pairs of sentences 

Source of negative 
pairs of sentences 

Random 

Text only 

Text + Visual 

Visual 

Improvement 

Original 

(in main paper) 

Same scene 

Different scenes 

33.33% 

94.15% 

95.55% 

+1.40% 

Coarse-grained 

Different scenes in 

the same 
semantic class 

Scenes from different 
semantic classes 

33.33% 

84.19% 

86.15% 

+1.96% 

Fine-grained 

Same scene 

Different scenes in 

the same semantic 
class 

33.33% 

54.79% 

56.43% 

+1.64% 


Qualitative Results: Fill-in-the-blank 


• Scenario 1: human, text baseline and our approach are all correct. 


Question 


A. Jenny and mike are 
angry at the dog. 


B. The bear has a 
hamburger and drink. 


Mike kicked the soocer ball. 


The duck is afraid of the soccer ball 


Answers 

Ground Truth: D 

Human: D (8/10) 
Text baseline: D 
Vision + text: D 


Original Scene 





C. The grill is next to the 
tree. 


D. Jenny wants the soccer 
ball. 


O 


O 

/ 

■ i 

- • • 
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Qualitative Results: Fill-in-the-blank 


• Scenario 1: human, text baseline and our approach are all correct. 

Question 

Jenny is standing on the swing. 

Mike is feeling sad. 


_ J 

Answers 


Ground Truth: B 

Human: B (5/10) 
Text baseline: B 
Vision + text: B 


Original Scene 



A. The dog is standing next 
to the table. 



C. Jenny is angry because 
it is raining on her. 



B. The sun is behind the 
tree. 



D. Jenny is near balloons 



Qualitative Results: Fill-in-the-blank 


• Scenario 2: human and our approach are correct while text baseline is incorrect 


Question 


A. Mike sees a pie. 


Jenny is in the sandbox 

The cat and Jenny have not left room for Mike 

V_ ^ _ ) 



Answers 

Ground Truth: B 

Human: B (9/10) 
Text baseline: C 
Vision + text: B 



C. Mike and Jenny are 
sitting next a fire 



B. The cat is sitting next to 
Jenny. 



D. Jenny is playing in the 
sandbox. 

o 
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Qualitative Results: Fill-in-the-blank 


• Scenario 2: human and our approach are correct while text baseline is incorrect 

A. Mike was wearing his 
_^ crown in the sandbox. 


Question 


B. The ball hits the duck. 


Mike and Jenny are scared of the duck. 
Happy duck walks away. 


Answers 

Ground Truth: B 

Human: B (5/10) 
Text baseline: A 
Vision + text: B 




Qualitative Results: Fill-in-the-blank 


• Scenario 3: human and text baseline are correct while our approach is incorrect 

A. There is an apple tree B. There are 3 hot dogs on 


Question 


f 

Jenny is petting the cat. 


No one is on the riding toy. 

V 

J 


behind Mike. 


the grill. 


Answers 

Ground Truth: C 

Fluman: C (8/10) 
Text baseline: C 
Vision + text: A 


Original Scene 





C. Mike is on the slide. 


D. Jenny is happy to see 
Mike. 



O 
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Qualitative Results: Fill-in-the-blank 


Scenario 3: human and text baseline are correct while our approach is incorrect 

B. The dog is watching 
Jenny. 


Question 


The burger is on the table. 


Jenny is standing next to table. 

V 

J 


Answers 

Ground Truth: D 

Human: D (4/10) 
Text baseline: D 
Vision + text: B 




C. Jenny threw the 
frisbee. 



D. Mike is standing next to 
table. 




Qualitative Results: Fill-in-the-blank 


• Scenario 4: human is correct while text baseline and our approach are incorrect 

A. Mike is sitting next to B. There are three 
the tree. hamburgers on the grill. 


Question 




Jenny is holding a pink pail. 


Mike threw the beach ball. 


V 

J 

Answers 

Qriginal Scene 



Ground Truth: D 

Human: D (7/10) 
Text baseline: C 
Vision + text: A 



C. A rocket ship is flying in 
the sky. 
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Qualitative Results: Fill-in-the-blank 


• Scenario 4: human is correct while text baseline and our approach are incorrect 

A. Mike is holding a beach 
ball 


Question 




Jenny and Mike are fighting. 


They are both wearing silly hats 


V 

) 


Answers 

Ground Truth: A 

Human: A (5/10) 
Text baseline: D 
Vision + text: D 


Original Scene 





C. The dog is watching 
Mike. 


D. Jenny kicked the 
football. 


9 


O 


t 


Qualitative Results: Fill-in-the-blank 


• Scenario 5: our approach and text baseline are correct while human is incorrect 


Question 

The duck is near the soccer ball. 
Jenny is sitting near the slide. 


V_ J 


A. Mike is standing under 
the hot air balloon 



Answers 

Ground Truth: A 

Human: B (8/10) 
Text baseline: A 
Vision + text: A 


Original Scene 


U T 



C. The snake is sliding 
behind Mike. 



B. Mike is sitting next to 
the dog. 



D. Mike is very surprised. 
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Qualitative Results: Fill-in-the-blank 


• Scenario 5: our approach and text baseline are correct while human is incorrect 

A. Mike is wearing sun B. Jenny is sitting next to 


Question 


Mike is holding the ball. 


Mike is playing with the cat. 

V 

J 


glasses. 


her juice. 


Answers 

Ground Truth: A 

Human: B (4/10) 
Text baseline: A 
Vision + text: A 



O 


o 

' ii 

.... 


C. The bear is roaring 
angrily. 


D. The duck is in the 
sandbox. 




Qualitative Results: Fill-in-the-blank 

Scenario 6: our approach is correct while human and text baseline are incorrect 

A. Jenny is trying to catch B. Mike is holding the 


Question 


the soccer ball 


shovel. 


Mike is wearing a hat. 
Jenny is holding the pizza. 


Answers 

Ground Truth: D 

Human: C (7/10) 
Text baseline: B 
Vision + text: D 


Qriginal Scene 

U 





C. Mike and Jenny are 
happy. 


D. Mike is sitting on the 
grass. 
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5/5/2015 


Qualitative Results: Fill-in-the-blank 


• Scenario 6: our approach is correct while human and text baseline are incorrect 

Question A. Mike is king for a day B. Jenny is angry at Mike. 




Mike is sitting on the grass. 


Jenny is standing by the table. 


V 

) 


Answers 

Ground Truth: C 

Human: D (5/10) 
Text baseline: D 
Vision + text: C 





C. Jenny is holding a pizza. 


D. Mike is wearing a viking 
hat. 




Qualitative Results: Fill-in-the-blank 


• Scenario 7: text baseline is correct while human and our approach are incorrect 


Question 

Jenny is jumping up and down. 

Mike is holding a frisbee. 

V_ ) 


A. Mike is wearing his 
viking hat. 



B. Mike and Jenny are 
camping 



Answers 

Ground Truth: A 

Human: B (7/10) 
Text baseline: A 
Vision + text: B 


Original Scene 



C. The rocket is soaring in 
the sky. 



D. Jenny told the bear to 
leave. 
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Qualitative Results: Fill-in-the-blank 


Scenario 7: text baseline is correct while human and our approach are incorrect 

A. Red apples grow on the 


Question 




Mike is playing in the sandbox. 


Jenny wants to play with Mike. 


V 

) 


tree. 


B. Mike is near jenny. 


Answers 


Original Scene 




Ground Truth: C 

Human: D (4/10) 
Text baseline: C 
Vision + text: D 



C. The sun is shining on 
Mike and Jenny. 



D. The pink shovel is on 
Jenny's lap. 


o 



Qualitative Results: Fill-in-the-blank 


• Scenario 8: human, text baseline 
Question 

Jenny is wearing a crown waving her hand. 


and our approach are all incorrect 

A. Mike is wearing a pirate 
-, hat. 


B. Mike is near the swings. 


The airplane is flying towards a giant cloud. 

V_ ) 

Answers Original Scene 

Ground Truth: D 

Human: A (9/10) 

Text baseline: A 
Vision + text: A 


—-1 


- '\w,: - 

9 



/ \ 


/ \ 


C. Mike has a baseball D. Mike is happily kicking 

bat. the soccer ball. 
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Qualitative Results: Fill-in-the-blank 


• Scenario 8: human, text baseline and our approach are all incorrect 
Question 

Jenny is upset she lost her balloons. 

Jenny is standing next to the cat. 


_ J 

Answers 


Ground Truth: D 

Human: C (4/10) 
Text baseline: B 
Vision + text: B 


Original Scene 



A. The airplane will not 
disturb them. 



B. Mike is angry that the 
dog is not listening. 


o 



D. Jenny is afraid the 
rocket will hit the balloon 



Qualitative Results: Visual Paraphrasing 

• Scenario 1: human, text baseline and our approach are all correct. 

Original Scene(s) Descriptions Generated Scenes Answers 

Ground truth 
Yes 

Human 
1.3753 

Text baseline 
1.221 

Vision + Text 
2.0805 



The bucket is in the sandbox. 
Mike runs to the ball. Mike is 
wearing a baseball cap. 


The bucket is in the sandbox. 
Mike runs to the ball. Mike is 
wearing a baseball cap. 



J 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 1: human, text baseline and our approach are all correct. 


Original Scene(s) Descriptions Generated Scenes Answers 




Ground truth 
Yes 

Human 

4.2825 

Text baseline 
1.9647 

Vision + Text 
2.1077 


Qualitative Results: Visual Paraphrasing 


• Scenario 1: human, text baseline and our approach are all correct. 


Original Scene(s) Descriptions Generated Scenes Answers 



Mike is holding a hot dog Jenny 
is carring ketchup. Jenny is 
running. 






Mike and Jenny are standing on 
the picnic table. Mike and Jenny 
are afraid of the bear. The owl is 
standing on the beach ball. 



Ground truth 
No 

Human 

-3.0058 

Text baseline 
-2.2792 

Vision + Text 
-2.5399 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 1: human, text baseline and our approach are all correct. 


Original Scene(s) Descriptions Generated Scenes Answers 



The bucket is in the sandbox. 
Mike runs to the ball. Mike is 
wearing a baseball cap. 


The bucket is in the sandbox. 
Mike runs to the ball. Mike is 
wearing a baseball cap. 

V___ ) 



Ground truth 
No 

Human 

-3.0058 

Text baseline 
-1.0911 

Vision + Text 
-1.3115 


Qualitative Results: Visual Paraphrasing 


• Scenario 2: human and our approach are correct while text baseline is incorrect 


Original Scene(s) Descriptions Generated Scenes 



Mike is angry because Jenny 
won't play. Jenny is crying 
because Mike is mean. The owl 
^watches the two children argue. 

/ 

The helicopter is flying above 
Jenny. Mike wants Jenny's 
Frisbee. Jenny is crying because 
Mike is mad. 

V___ 




y 


y 



Answers 

Ground truth 
Yes 

Human 

1.3753 

Text baseline 
-0.1311 

Vision + Text 
0.2123 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 2: human and our approach are correct while text baseline is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 


It is raining on the tent. Jenny is 
sitting on the ground. Mike is 
very mad. 


V 

r 




Jenny is sitting n the grass. Mike 
is angry with a dog. There is a 
burger on the grill 

V_____y 




Ground truth 
Yes 

Human 

2.7909 

Text baseline 
-0.1274 

Vision + Text 
0.2949 


Qualitative Results: Visual Paraphrasing 


• Scenario 2: human and our approach are correct while text baseline is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



A lightening bolt flashes in the 
sky. Jenny is wearing a crown. 
Mike is shouting at Jenny. 






Jenny is singing on the swingset. 
Mike is happy to see Jenny at the 
park. The hot air ballon is high in 
the sky. 



Ground truth 
No 

Human 

-3.0058 

Text baseline 
0.2635 

Vision + Text 
-0.2044 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 

• Scenario 2: human and our approach are correct while text baseline is incorrect 
Original Scene(s) Descriptions Generated Scenes Answers 



Vv 




Jenny is running from a snake. 
Mike is chasing after the snake. It 
is raining on Jenny. 


Jenny and Mike are afraid of the 
snake. Jenny is playing with a 
bat. Mike is jumping up. 






A 




Ground truth 
No 

Human 

-3.0058 

Text baseline 

0.1347 

Vision + Text 

-0.5795 


Qualitative Results: Visual Paraphrasing 

• Scenario 3: human and text baseline are correct while our approach is incorrect 
Original Scene(s) Descriptions Generated Scenes Answers 

O 






i 

' rrj) 


Mike and Jenny are having a 
barbecue. Jenny is excited to see 
a dog. Mike is angry at the dog 
for begging. 

Jenny is sitting on the ground. 

Mike does not like his 
hamburger. The dog is wearing a 
blue collar 


k - 


o 


Ground truth 
Yes 


Human 

1.3753 

Text baseline 

0.3909 

Vision + Text 

-0.1280 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 3: human and text baseline are correct while our approach is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



r 

The cool dog is wearing 
sunglasses. The cat is jealous of 
the dog. Mike and Jenny play on 
the slide. 

V___ 

r 






Mr. Dog is cool in sunglasses. 
Mike bumps into Jenny. Jenny is 
surprised by Mr. Dog. 


V 





Ground truth 
Yes 

Human 

1.3753 

Text baseline 
0.0509 

Vision + Text 
-0.6838 


Qualitative Results: Visual Paraphrasing 


• Scenario 3: human and text baseline are correct while our approach is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



r ^ 

It is raining on Jenny. Mike wants 
Jenny's lunch. Jenny is giving 
Mike her wet lunch. 


r 




Jenny has a blue cap. Mike has a 
viking helmet. There are 2 trees. 



Ground truth 
No 

Human 

-1.5452 

Text baseline 
-0.0278 

Vision + Text 
0.2061 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 3: human and text baseline are correct while our approach is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Jenny wears sunglasses Mike 
catches the football jenny is 
wearing a witch's hat 

V___ 

r 




Mike is kicking the ball. Jenny 
wants to catch the ball. Jenny is 
smiling at Mike. 

V___y 



Ground truth 
No 

Human 

-1.5452 

Text baseline 
-0.6850 

Vision + Text 
0.1486 


Qualitative Results: Visual Paraphrasing 


Scenario 4: human is correct while text baseline and our approach are incorrect 


Original Scene(s) 


Descriptions 


Generated Scenes Answers 



Mike is shooing the dag away. 
Jenny is waiting for a hamburger. 
The balloon flies over the 
playground. 


Mike is cooking the burger. The 
dog is standing next to the pit. 
Jenny issitting in the grass. 


A.-' 



Ground truth 
Yes 

Human 

4.2825 

Text baseline 
-0.1836 

Vision + Text 
-0.3634 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 

• Scenario 4: human is correct while text baseline and our approach are incorrect 

Original Scene(s) Descriptions Generated Scenes Answers 

f ^ ^ 

Mike is wearing a beanie cap. 

The dog wants to eat the 
hamburger. Jenny is happy to see 
Mike. 

V _ _! _ J 

r \ 

Mike is wearing a funny hat 
Jenny is laughing at Mike's hat 
Jenny is sitting next to the table 

V ___y 




Ground truth 
Yes 

Human 

2.7909 

Text baseline 
-0.4538 

Vision + Text 
-0.4682 


Qualitative Results: Visual Paraphrasing 


• Scenario 4: human is correct while text baseline and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



f 

Jenny stood next to the fire. The 
dog watched the hamburgers on 
the grill. Mike flew into the sky 
with the mustard on his shirt. 

V___ 

r 






Mike is near a grill. A dog is near 
jenny, there are three hot-dogs 
on the grill. 

V___y 






Ground truth 
No 

Human 

-1.5452 

Text baseline 
1.7038 

Vision + Text 
1.2092 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 4: human is correct while text baseline and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Mike is wearing a blue cap. Jenny 
is wearing a sunglasses. Jenny 
and Mike are playing catch. 


Mike is wearing a funny hat. 
Jenny is jumping off the ground. 
Mike is scared of something. 

V_ ) 



Ground truth 
No 

Human 

-1.5452 

Text baseline 
0.5427 

Vision + Text 
0.2067 


Qualitative Results: Visual Paraphrasing 


• Scenario 5: our approach and text baseline are correct while human is incorrect 


Original Scene(s) Descriptions Generated Scenes 




Answers 

Ground truth 
Yes 

Human 

-1.5452 

Text baseline 
0.5894 

Vision + Text 
0.6304 
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5/5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 5: our approach and text baseline are correct while human is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 




Ground truth 
Yes 

Human 

-1.5452 

Text baseline 
0.8277 

Vision + Text 
1.2425 


Qualitative Results: Visual Paraphrasing 


• Scenario 5: our approach and text baseline are correct while human is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Jenny is upset. Jenny doesn't like 
cats. The dog will cheer Jenny 
up. 


V_ 

f 

Jenny is crying by the cat and 
dog. Jenny is holding her hands 
out to the animals. There are 
balloons in the background. 







Ground truth 
No 

Human 

1.3753 

Text baseline 
-0.0449 

Vision + Text 
-0.2418 
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5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 5: our approach and text baseline are correct while human is incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



f ^ 

Mike is wearing a hat. The bear 

is roaring at Mike. Mike is in 
front of a tree. 


V 

r 




Mike is wearing a pirate hat. 
Jenny is wearing a crown. Jenny 
is holding her drink. 

V___y 




Ground truth 

No 



Human 

M 


1.3753 

O 


Text baseline 
-1.1950 



Vision + Text 

-1.1451 


Qualitative Results: Visual Paraphrasing 


• Scenario 6: our approach is correct while human and text baseline are incorrect 


Original Scene(s) Descriptions Generated Scenes 





Answers 

Ground truth 
Yes 

Human 

-1.5452 

Text baseline 
-0.0771 

Vision + Text 
0.6696 
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5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 6: our approach is correct while human and text baseline are incorrect 
Original Scene(s) Descriptions Generated Scenes Answers 



Mike and Jenny are sitting on the 
ground. Two balls are on the 
ground. Mike is next to the slide. 


Jenny is sitting in the grass. Mike 
is wearing a vikings hat. Jenny is 
very surprised. 


o 


o 


Ground truth 
Yes 

Human 

-3.0058 

Text baseline 
-0.0863 

Vision + Text 
0.1524 


Qualitative Results: Visual Paraphrasing 


• Scenario 6: our approach is correct while human and text baseline are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 


O 






a 


Mike is wearing a pirate hat. 
Jenny is wearing a funny hat. A 
dog is looking for something in 
the grass. 

-^ 

->. 

There is a rocket in the sky. Mike 
and Jenny are sitting on the 
ground. There is a dog in front of 
Mike and Jenny. 


p 

1©, _ 


Ground truth 


No 

i:-.' " - ■ 


Human 

.A 


1.3753 

o 


Text baseline 



0.2037 

Vision + Text 



-0.0009 


21 

























































5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 6: our approach is correct while human and text baseline are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



There's a pie on the table Jenny 
is wearing purple sunglasses 
Mike is beside the grill 

V ___ ) 

f Mike put the hamburger onto ^ 

the grill. Jenny was excited the 
hamburger was almost done. 
Mike cooked both hamburgers 

V _ and hotdogs. _ J 



Ground truth 
No 

Human 

1.3753 

Text baseline 
0.3193 

Vision + Text 
-0.1845 


Qualitative Results: Visual Paraphrasing 


• Scenario 7: text baseline is correct while human and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes 



Mike is holding a hot dog Jenny 
is carring ketchup. Jenny is 


running. 


Mike is very happy. Jenny is very 
happy. A dog is near a tree. 


Q 


o 




Answers 

Ground truth 
Yes 

Human 

-1.5452 

Text baseline 
0.6291 

Vision + Text 
-0.1716 
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5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 7: text baseline is correct while human and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Rain is falling from the cloud. 
The dog is standing in front of 
Mike. Mike is wearing 
sunglasses. 


Jenny is waving to Mike. Mike 
has a soda pop. It is raining 
today. 



Ground truth 
Yes 

Human 

-1.5452 

Text baseline 
0.0348 

Vision + Text 
-0.0688 


Qualitative Results: Visual Paraphrasing 


• Scenario 7: text baseline is correct while human and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



The dog is on the table. Mike has 
a hamburger. Jenny has a drink. 


r 




The plane is flying low. Mike likes 
hamburgers with ketchup. Jenny 
is laughing at Mike's joke. 

V___ J 



Ground truth 
No 

Human 

1.3753 

Text baseline 
-0.3248 

Vision + Text 
0.1170 
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5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 7: text baseline is correct while human and our approach are incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Lightning is coming out of the 
cloud. Mike and Jenny are angry. 
Mike is playing with a beach ball. 




Mike and Jenny run away. Mike 
and Jenny are scared of 
lightening. Lightening is in the 
sky. 







Ground truth 
No 

Human 

1.3753 

Text baseline 
-0.0142 

Vision + Text 
0.8637 


Qualitative Results: Visual Paraphrasing 

• Scenario 8: human, text baseline and our approach are all incorrect 

Original Scene(s) Descriptions Generated Scenes Answers 

Ground truth 
Yes 

Human 
-1.5452 

Text baseline 
-0.0217 

Vision + Text 
-0.3078 





Mike is throwing the frisbee. 
Jenny is throwing the ball. The 
dog is standing next to the tree. 


A dog has a baseball Jenny is 
running Mike is smiling 




o 


J 
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5 / 5/2015 


Qualitative Results: Visual Paraphrasing 


• Scenario 8: human, text baseline and our approach are all incorrect 


Original Scene(s) Descriptions 



Generated Scenes Answers 



Ground truth 
Yes 

Human 

-3.0058 

Text baseline 
-0.6132 

Vision + Text 
-0.3347 


Qualitative Results: Visual Paraphrasing 

• Scenario 8: human, text baseline and our approach are all incorrect 

Original Scene(s) Descriptions Generated Scenes Answers 

Ground truth 
No 

Human 
1.3753 

Text baseline 
1.1652 

Vision + Text 
1.0543 





Mike and Jenny play on the 
swings. The dog watches Mike 
on the swing. The tall tree looks 
pretty. 


Jenny is playing on the swing. 
The dog is standing next to mike. 
Mike is holding a burger. 


J 
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Qualitative Results: Visual Paraphrasing 


• Scenario 8: human, text baseline and our approach are all incorrect 


Original Scene(s) Descriptions Generated Scenes Answers 



Jenny is kicking a ball. Jenny is 
wearing sunglasses. Mike is 
smiling. 


It is a sunny day. Mike is sitting 
with a pizza. Jenny is playing 
with a soccer ball. 





Ground truth 
No 

Human 

4.2825 

Text baseline 
0.0234 

Vision + Text 
0.1555 
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